Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Speech enhancement (SE) reduces background noise signals in target speech and is applied at the front end in various real-world applications, including robust ASRs and real-time processing in mobile phone communications. SE systems are commonly integrated into mobile phones to increase quality and intelligibility. As a result, a low-latency system is required to operate in real-world applications. On the other hand, these systems need efficient optimization. This research focuses on the single-microphone SE operating in real-time systems with better optimization. We propose a causal data-driven model that uses attention encoder-decoder long short-term memory (LSTM) to estimate the time-frequency mask from a noisy speech in order to make a clean speech for real-time applications that need low-latency causal processing. The encoder-decoder LSTM and a causal attention mechanism are used in the proposed model. Furthermore, a dynamical-weighted (DW) loss function is proposed to improve model learning by varying the weight loss values. Experiments demonstrated that the proposed model consistently improves voice quality, intelligibility, and noise suppression. In the causal processing mode, the LSTM-based estimated suppression time-frequency mask outperforms the baseline model for unseen noise types. The proposed SE improved the STOI by 2.64% (baseline LSTM-IRM), 6.6% (LSTM-KF), 4.18% (DeepXi-KF), and 3.58% (DeepResGRU-KF). In addition, we examine word error rates (WERs) using Google’s Automatic Speech Recognition (ASR). The ASR results show that error rates decreased from 46.33% (noisy signals) to 13.11% (proposed) 15.73% (LSTM), and 14.97% (LSTM-KF).


Introduction
Speech signals in many real-world situations are degraded by noise signals. A degraded signal severely influences the performance of many speech-related applications, such as automatic speech recognition [1], speaker identification [2], and hearing aid devices [3] enhancement system is mainly involved in restoring the quality and improving the intelligibility of the signals degraded by the noise and is used at the front end of many speech applications to enhance their performance in noisy situations where they are altered. Background noises, competing speakers, and room reverberation are mainly the major sources of variations and distortions. An SE algorithm ideally ought to work passably in various acoustic situations, a broad-spectrum algorithm that is capable of performing well with little complexity and latency in every noisy situation is a challenging technical task. The conventional approaches include spectral subtraction [4], Wiener filtering [5], statistical models [6][7][8], and hybrid SE models [9,10] They show better performance in many stationary noises but face difficulties in handling nonstationary noises. In the recent past, deep learning has been developed into the mainstream for speech enhancement [11]. Given a speech dataset of the clean-noisy pairs, the neural networks can learn to transform the noisy magnitude spectra to their clean counterparts (mapping based) [12][13][14] or estimate the time-frequency masks (masking-based) such as ideal binary mask (IBM) [15,16], ideal ratio mask (IRM) [17,18], and spectral magnitude mask (SMM) [19]. Fully connected networks (FCN) [19], feedforward neural networks (FDNN) with Kalman filtering [20], recurrent neural networks (RNN) [21][22][23], and convolutional neural networks (CNN) [24,25] are important deep learning approaches in SE.
A fully connected feedforward neural network showed that a DNN trained for a large number of background noises with a single speaker generalized better to untrained noise types [26]. Such a network, however, shows the difficulty in generalizing to both untrained speakers and noises when trained with a large number of speakers and noises. The RNN with LSTM is used to design a noise-and speaker-independent network for speech enhancement. A fourlayered RNN network was used to train speech utterances belonging to 77 different speakers combined with 10,000 different noise types [26]. Recently, SE has aimed to improve the performance of the speaker and noise-independent networks. In [25], a CNN with gated and dilated convolutions is proposed for the magnitude enhancement. A recent trend is the use of attention mechanisms to improve the quality and intelligibility of noisy speech signals. In [27] a speech enhancement approach is proposed and used with an attention LSTM by replacing the forgetting gate with an attention gate. In [24] a dense CNN with a self-attention is proposed to assist feature extraction using a combination of feature reuse. In [28] a dual-path self-attention RNN is proposed to improve the long sequence of speech frames. A number of deep learning studies based on the attention mechanism for SE are successfully proposed with novel results [29][30][31][32][33][34].
In most of the deep learning approaches, mean-square error (MSE) is used as the loss function [15][16][17][18][19]. Other loss functions include Huber and mean absolute error (MAE). The gradient of the MAE remains invariant during training when loss approaches zero, resulting in missing the minima. Moreover, the Huber loss needs hyperparameter tuning. This brings further complexity when a loss function is dynamically weighted. A large error indicates poor learning on a particular instance in the dataset. A dynamically weighted loss function is used to alter the learning process by augmenting the weighted values corresponding to the learning errors. Through such an amendment, the loss function focuses on the large learning errors and improves the network performance.
In this paper, an attention encoder-decoder LSTM network for sequence-to-sequence learning is proposed. The motivation behind this research is the recent success of the attention mechanism in speech emotion recognition [33] and speech recognition [34]. Deep learning approaches can be regression or prediction tasks [35,36]. It is useful to employ the attention process in speech enhancement since a human can focus on a certain part of a speech stream with more attention, such as target speech, whereas they perceive the surrounding noise with less attention. We have used an attention process on the encoder-decoder LSTM network that has been shown to perform better in modeling vital sequential information. LSTM [37][38][39] can learn the weights of the past input features perfectly and predict the enhanced frames. The attention process determines the correlations between the previous frames and the current frames be enhanced and assigns weights to the previous frames. Experiments have shown that the proposed network consistently performed better in terms of speech quality and intelligibility. The overall structure of the proposed speech enhancement algorithm is depicted in Fig 1. We have summarized the main contributions of this study.
• For sequential learning to handle real-time speech applications that need low-latency causal processing, a causal speech enhancement based on attention encoder-decoder LSTM network is proposed.
• By adding weighted values for large learning errors, a dynamically weighted loss function is used to improve the learning process. The loss function focuses on the large learning errors to further improve the network performance.
• Automatic speech recognition is evaluated using estimated magnitude, thereby notably improving the word error rate in noisy situations.
The remainder of this paper is organized as follows. In Section 2, we explain the proposed speech enhancement algorithm. The dynamically-weighted loss is presented in Section 3. The experimentation is presented in Section 4. The results and discussions are presented in Section 5. Finally, the conclusions are drawn in Section 6.

Proposed speech enhancement
For a given clean speech signal x t and noise signal d t , the noisy speech signal y t is formed by the additive mixing as follows: where x; y; d g 2 R N�1 � and N shows number of the speech frames. A SE algorithm aims to recover a close estimatex t of the clean speech x t given y t . The inputs to the LSTM Encoder-Decoder are Y = [y 1 , .., y t , .., y N ], where y t indicates the spectral magnitudes of the noisy speech at frame t. The high-level features h are extracted by the encoder from the input speech frames: where h K and h Q stand for the key and query, respectively. In this study, unidirectional LSTM is used as an encoder which shows a strong ability to model the sequential data leading to the improved performance of the speech enhancement [39]. The attention process is fed with key and query as the input to create fixed-length context vectors: The decoder output w t is the recovered enhanced speech signalx t which takes the context vectors C t , the output of the encoder h Q , and the noisy speech y t , respectively.
The proposed attention encoder-decoder LSTM is depicted in Fig 2.

Unidirectional LSTM encoder
The LSTM encoder extracts the high-level feature representations from the input speech frames. The input features are first fed into a fully-connected layer. The y t is the input to the LSTM cell as:  where f(�) is LSTM function whereas h K t is LSTM output, respectively. The h Q t can be computed as:

Attention process
The attention process is fed with information about the key and query as inputs to create fixed-length context vectors. An attention process can use both previous and future speech frames. But, SE is a causal problem and uses previous speech frames to avoid processing latency. We have used casual dynamic and causal local attention approaches. To enhance a speech frame in causal dynamic attention, Y = [y 1 , .., y t ] is used to compute the attention weights which means that all the previous speech frames are used to enhance the current frames. If the duration of the speech utterance is long, the attention weights of several previous speech frames can nearly be zero. Therefore, in casual local attention process, Y = [y 1 , .., y t ] is used to compute the attention weights. The z is set to a constant. The normalized attention weight κ can be learned as: l = 1 for causal dynamic attention whereas l = (t − z) is used for the causal local attention. According to correlation computation, we have: The context vector with attention weights is given as: With an attention-weighted context vector, the model decides the attention process.

Unidirectional LSTM decoder.
The decoder recovers the output-enhanced speech by using the input features, encoder output, and context vector, respectively. The enhanced vector E t is learned from context vectors and features as: where ½C t ; h Q t � shows the concatenation of the context and feature vectors, respectively. The ideal ratio mask (IRM) is finally estimated from the feature vectors. The time-frequency IRM (f, t) is given as: IRMðf ; tÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi jXðf ; tÞj where |X (f,t)| and |D (f,t)| show the magnitude spectra of clean speech and noise signals, respectively. The enhanced vectors are multiplied with the noisy features to recover the enhanced speech by taking the inverse Short-time Fourier Transform (STFT) as:

Weighted loss function
In masking-based deep learning methods for SE, a loss function presents a divergence between the predefined and the estimated mask. A loss function aims to reduce the errors produced during training. Mostly, the MSE (mean square error) is used as the basic loss function, given as: where m, and n denote the input and the predicted value, respectively. Eq (13) can be represented in terms of the time-frequency mask as: whereMðxÞ and M(x) denote the estimated and predefined IRM masks, respectively. The dynamical-weight loss function is used to adjust the network learning by multiplying weighted values corresponding to the learning errors. Thus, the loss function focuses on large errors to improve performance. The MSE loss function is multiplied by a weighted variable O to get the weighted MSE as: To emphasize the instances with large errors, the weight variable O in Eq (15) is updated according. The weight selection is done according to the following condition: Where |.| indicates the magnitude of ground truth and estimated masks. The weighting became halved when the absolute divergence is less than constant B which is set to 10 since it has been observed that the performance of the model at this instance was better.

Data generation
In experiments, we have used IEEE dataset [40]. Two IEEE datasets are used which are composed of male speakers and female speakers, respectively. The noise sources are selected from AURORA [41]. For a given speech dataset D, we have M tr and M te as the training and testing speech utterances. The training and testing speech utterances in the dataset are denoted by D tr and D te , respectively. The noisy utterances are generated by adding the noise signals to D tr and D te :

Feature extraction
The input pairs y, x, d are transformed from the time to frequency domain using the STFT as: Where X; Y; D g 2 Z T�F � , T and F show frame number and frequency bin number. We have used STFT magnitude |Y| as the input features.

Experimental setup
We have used speech utterances with a 16 kHz sampling rate. A 512 points Hanning window with 75% overlapping is used. We used the noisy phase during waveform reconstruction. The network consists of an input layer, three unidirectional LSTMs with 256 memory units followed by a fully connected output layer with 257 sigmoidal units. The number of epochs and the learning rate is set to 160 and 0.001, respectively. The weights are randomly initialized and trained with 32 sequences mini-batches by back-propagation through time with Adam optimizer. The three-layered LSTM network architecture with (128/256/256/256/257) memory cells is used. The details of hyperparameters are given in Table 1. To create the noisy utterances, three SNR levels are used (-5dB to +5dB) with a 5dB step size. For network training, IEEE speech utterances from the male and female speakers are duplicated three times for each SNR level and mixed with all noise types. Therefore, a total of 21600 (approx. 18 hours) speech utterances are used in the training process. We also used half-speech utterances during testing in the matched and mismatched conditions. During testing, each noise type is tested with a different set of utterances.

Results and discussions
To signify the performance of the proposed speech enhancement method, we have compared the results with baseline LSTM-IRM [26], DNN-IRM [18], LMMSE [6], OM-LSA [7], FDNN-KF [20], LSTM-KF [44] DeepXi-KF [45] and DeepResGRU-KF [46]. For matched and unmatched conditions, the average STOI, PESQ, and SDR test values across several noise sources and SNRs are given in Tables 2-5. Note that, in contrast to the competing deep-learning methods, the proposed DWAtten-LSTM presents the highest performance in terms of the STOI, PESQ, and SDR values in noisy situations. Two conditions including Matched and Mismatched are considered in experiments. During match conditions, the speakers and utterances remain the same in training and testing sets whereas, in mismatched conditions, the speakers and utterances in the training set are different from the testing set.
In the matched conditions (Tables 2-4), the proposed DWAtten-LSTM approach achieved the best STOI, PESQ, and SDR values with airport noise, street noise, and car noise at SNR�5dB, that is, STOI�94%, PESQ�2.99, and SDR�10.8dB. The STOI with the babble noise is improved from 68.1% with noisy speech signals to 85.2% with DWAtten-LSTM and achieves 17.1% improvement in STOI at 0dB SNR. Similarly, the PESQ with airport noise is improved from 1.93 with DNN-IRM to 2.29 with proposed DWAtten-LSTM and improved the PESQ by factor 0.36 (18.75%) at -5dB SNR. Moreover, the SDR value with car noise is improved from 4.20dB with LSTM-IRM to 4.56dB with DWAtten-LSTM and achieved 0.36dB (8.57%) improvement at -5dB SNR level. Note from Table 2 (matched condition) that DWAtten-LSTM presents the highest STOI, PESQ, and SDR values in all noisy situations as compared to baseline LSTM with the same network architecture.
The average STOI, PESQ, and SDR test values across all noise sources and SNRs are given in Table 5 for unmatched conditions where the proposed DWAtten-LSTM achieved the best STOI, PESQ, and SDR values at airport noise at SNR�5dB, that is, STOI�92.3%, PESQ�2.86, and SDR�10.7dB. The STOI with factory noise is improved from 55.2% with noisy speech signals to 77.0% with DWAtten-LSTM and achieves 21.8% STOI gain at -5dB SNR. Fig 3 shows STOI, PESQ, and SDR improvements in various background noises where we can see the performance of the proposed DWAtten-LSTM in individual noise at -5dB, 0dB, and 5dB SNRs.
We also compared DWAtten-LSTM to non-deep learning-based LMMSE and OM-LSA. Table 6   The overall average STOI, PESQ, and SDR values for all noise sources are given in Table 6 for matched (denoted as Proposed-M), unmatched (denoted by Proposed-UM) conditions, and the average of both matched and unmatched (denoted as Proposed-Avg). PESQ and STOI values are calculated for the causal local attention process, and the value of z is varied from 4 to 12 with an increment of 4. Table 7 shows the results. It is noticed that values greater than 12 for z result in no further improvements and the best performance is achieved for z = 4. As compared to causal dynamic attention, causal local attention showed better results. The observations verified the inference that extensive previous information is not required in speech enhancement. This inference is logical since noisy situations, both types and SNRs, change over time. The observations are valid for the attention networks since the attention LSTM outperformed the baseline LSTM.

PLOS ONE
Causal speech enhancement Table 8 shows the comparison between the loss errors and predicted results of the DWAtten-LSTM with and without the DW loss function. The DW loss function improved the predicted scores (STOI and PESQ). The errors are reduced by weighted MSE (3.42 × 10 −4 ) as compared to a non-weighted MSE (3.54 × 10 −4 ).
To understand the attention process, the attention maps are illustrated in Fig 5. The x-axis denotes h K and the y-axis denotes h Q . The points, (x;y) denote the attention weights. The attention-based network assigns different attention levels (weights) to the contextual frames. The top spectrogram shows noisy speech, and the other spectrogram shows clean speech, respectively. In experiments, time-varying spectral analysis is conducted to showcase the performance of DWAtten-LSTM. Fig 6 demonstrates the sample spectrogram analysis. A clean speech utterance is mixed with babble noise at 5 dB. The spectrogram of DWAtten-LSTM is plotted in Fig 6(F). The harmonic structures of the vowel and the formant peaks are well retained. Moreover, the spectrogram showed excellent structure during speech activity. During the speech pause, DWAtten-LSTM removed the residual noise signals. The weak harmonic structures in the high-frequency sub-bands are well maintained. Thus, a better speech quality of the enhanced speech is achieved by DWAtten-LSTM. The weak energy in the speech utterance is also well retained and yields less speech distortion. Therefore, the intelligibility of noisy speech is improved. The residual noise signals are evident in the spectrograms of LMMSE and OM-LSA, plotted in Fig 6(C) and 6(D). The complexity and convergence analysis are also given. The complexity of a deep learning algorithm revolves around the number of training parameters; LSTM networks have 1.2 million parameters. This is clearly a fewer number as compared to other networks used for speech enhancement, for example, 10 million parameters are used by the residual LSTM [47]. This also significantly reduces the training time and speeds up the process. DWAtten-LSTM took less time per epoch compared to the Residual LSTM (using an NVIDIA GTX 950 Ti GPU). Next, we observed the convergence of Weighted-MSE between the estimated and true values   for the training and testing data sets of DWAtten-LSTM. The MSE has been reduced after every epoch until converging at around epoch 155.
According to STOI, PESQ, and SDR, the following inferences are drawn. Under various noisy situations, PESQ, STOI, and SDR values indicate that DWAtten-LSTM achieved the best improvements in quality (PESQi), intelligibility (STOIi), and speech distortion (SDRi) as compared to the competing deep learning and non-deep learning methods. The proposed DWAtten-LSTM method improved the quality without degrading speech intelligibility in noisy situations. All deep-learning methods showed repeated improvements in STOI and SDR values, which suggests the potential of deep learning for speech enhancement tasks.
The ASR systems use a magnitude spectrum of speech signals, and one would expect that deep learning approaches would certainly improve ASR performance in noisy situations. For ASR systems, SE algorithms operate at the front end. We have used Google ASR [48] to examine the ASR performance in terms of the WERs. The average WERs are given in Table 9,

Subjective evaluation
In addition, we have conducted subjective listening tests to assess the perceptual quality of enhanced speech. The enhanced speech utterances are randomly chosen from various noise  https://doi.org/10.1371/journal.pone.0285629.g006

PLOS ONE
Causal speech enhancement sources (airport, babble, factory, and restaurant) using three SNRs, which are -5 dB, 0 dB, and 5 dB. In total, 300 speech utterances are used to assess DNN, LSTM, and the proposed SE. The participants are requested to assign a score (from 0 to 5) according to perceived speech quality. During experiments, no speech utterance is repeated. The listening tests are conducted in an isolated room using high-quality headphones. The data of the listeners who participated in the subjective listening tests for speech quality are given in Table 10. Prior training sessions are arranged to educate the listeners about the procedures.

Speech dereverberation
This section examined the dereverberation performance of the proposed SE. To train the SE model, three reverberation times (0.4 sec, 0.6 sec, and 0.8 sec) are considered. A total of 100 anechoic speech utterances from the IEEE dataset [40] are used to create the training dataset. The testing dataset contains 40 reverberant speech utterances. There is no overlap between the speech utterances used during model training and testing. The proposed method with reverberant speech utterances is compared and examined for dereverberation. The results are compared with the study of Wu and Wang [49], where estimated inverse filters and spectral subtraction are used to reduce reverberation. Table 11 shows the results using STOI and PESQ. The proposed method delivered the best STOI and PESQ scores, i.e., STOI�78.3%, and PESQ�2.45 at RT�4 sec. The spectrograms are provided in Fig 8, where the smearing energy produced by reverberation is considerably reduced, showing that the reverberation performance of the proposed method is improved.

Conclusions
In this paper, we have proposed a monaural SE based on the attention LSTM encoder-decoder model with a novel loss function. The proposed DWAtten-LSTM estimated the magnitude spectrum from the noisy speech signals using an ideal ratio mask. We have compared this model to the baseline and competing for deep learning and non-deep-learning methods for speech intelligibility and quality assessment. The objective assessments are accomplished in various noisy situations using three input SNR levels. The PESQ and SDR values indicated that the proposed DWAtten-LSTM achieved significant gains of 0.79 (45.93%) and 6.96dB over noisy speech. Similarly, STOI indicated that DWAtten-LSTM kept intelligibility in all noisy situations and STOI achieved a large gain of 16.70% over the noisy speech. The subjective analysis confirmed the success of the proposed model in terms of speech quality. The results and analysis concluded that we achieved better results in terms of speech quality and intelligibility with the proposed DWAtten-LSTM. The attention process observations verified the inference that extensive previous information is not vital in speech enhancement. The proposed loss function significantly improved the model learning. Although deep learning for speech enhancement outperformed the conventional methods with their complex network architectures, yet required less computationally complex and efficient network architectures for improved performance. The proposed DWAtten-LSTM SE algorithm has demonstrated considerable performance gain as compared to the baseline LSTM and FDNN and achieved higher performance gains when compared to the conventional SE.
Our future research will focus on further improving the quality and intelligibility by proposing computationally less complex network architectures in intense unseen noises and speakers. Moreover, phase estimation will also be included to increase the speech quality. This study used STFT as a transformation tool for frequency domain representation; however, several transformations are available in the literature. In future studies, these transformations [50][51][52][53] will be used for more in-depth analysis.